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Abstract 

This paper focuses on Bayesian shrinkage for covariance matrix estimation. We 
examine posterior properties and frequentist risks of Bayesian estimators based 
on new hierarchical inverse- Wishart priors. More precisely, we give the existence 
conditions of the posterior distributions. Advantages in terms of numerical sim- 
ulations of posteriors are shown. A simulation study illustrates the performance 
of the estimation procedures under three loss functions for relevant sample sizes 
and various covariance structures. 
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1. Introduction 

Estimating a covariance matrix efficiently is an important statistical issue. 
Often, applied scientific problems require an estimate of a covariance matrix in 
the context of a large matrix dimension p relative to the number of observations 
n. In such settings, standard estimators - the sample covariance matrix or the 
maximum likelihood estimator - are known to perform poorly [26, 27, 4]. When 
n is smaller than p, they are not positive definite. When it is larger, they are 
invertible but still inappropriate because unstable, unless £ is negligibly small. 
Indeed, if n is of the same order as p, the sample eigenvalues significantly deviate 
from to the population eigenvalues [27, 4]. This fact has incited many authors 
to focus on the eigenvalues attempting to overcome their distortion. 
Some approaches to more stably estimating the matrix in small samples have 
been proposed, such as shrinkage methods on which we will focus. The work 
along these directions can be found in both frequentist and Bayesian frame- 
works. The proposed estimators for the covariance matrixare then derived from 
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a decision-theoretic perspective or associated with an appropriate prior and a 
specific loss function. 

James and Stein [14] were the first to propose biased estimators for covariance 
matrixunder Stein's loss function, dominating the classical sample covariance 
matrix. Since, many authors have explored improved James-Stein type estima- 
tors under Stein's loss [27, 5, 17, 20] or other losses [8, 12, 16, 20]. Ledoit and 
Wolf [18] consider Steinian shrinkage toward the single-index covariance matrix 
to estimate the covariance matrix of stock returns. Furrer and Bengtsson [10] 
consider "tapering" the sample covariance matrix, that is, gradually shrinking 
the off-diagonal elements toward zero. 

From a Bayesian perspective, the common approach [13, 2] yielding estima- 
tors which shrink towards a structure uses the conjugate prior on the covariance 
matrix, an inverse- Wishart distribution with some degrees of freedom and a scale 
matrix as hyperparameters. The appeal of conjugate priors is to allow efficient 
posterior simulations but such priors might be contested because of their lack of 
flexibility; specifically, with an inverse- Wishart prior, only one parameter does 
control the variability of the matrix elements. Efficient Bayesian estimators 
involving more flexible priors are obtained using various decompositions, the 
most well-known are derived from the variance-correlation strategy introduced 
by Barnard and al. [1, 3], the spectral [3, 28] or the Cholesky [25] decompo- 
sitions of the covariance matrix and the matrix-logarithmic covariance model 
[19]. In particular, in [1], the covariance matrixis modeled in terms of standard 
deviations and correlations. After discussion about suitable priors, the authors 
advocate a flat prior on the space of correlation matrices and log normal priors 
on the variances. Yang and Berger [28] develop a reference prior approach for the 
covariance matrix, approach known to outperform another common noninfor- 
mative prior, Jeffreys prior [15], for high-dimensional problems. Working with 
the spectral decomposition of the matrix, their method shrinks the eigenvalues, 
it results in a better estimation of the underlying eigenstructure. Smith and 
Kohn [25] use a prior that allows for zero elements in the strict lower triangle 
of the Cholesky decomposition of the inverse of the covariance matrix. Leonard 
and Hsu [19] model the matrix logarithm of the covariance matrix and place 
a multivariate normal distribution on the p ^ p 2 ^ 1 - ) vectorized non-redundant ele- 
ments of this matrix. Daniels and Kass [3] focus on shrinking the matrix toward 
a structure, specifically, a diagonal matrix, using a fully Bayesian approach and 
using three different priors: normal priors on the z-transform of the correla- 
tions, normal priors on the logit of the Givens angles and inverse- Wishart prior. 
These parametrizations do not have simple statistical interpretation and the 
use of these methods has been limited in view of the difficulties of computation 
implied. 

Moreover all these parametrizations do not solve the difficulty in handling and 
estimating hyperparameters. Empirical Bayes estimates of the hyperparame- 
ters can be proposed, for instance by [13, 2, 24] in the case of inverse- Wishart 
prior. An alternative to this empirical specification is to use hierarchical mod- 
eling. Multilevel modeling will ensure that uncertainty in higher level param- 
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eters propagates into inferences on lower level parameters and is supposed to 
be more flexible and also more stable than those based on diffuse priors [11]. 
Applied to the case of inverse- Wishart prior, it is intended to offer objectivity 
in terms of how close the true matrix is to the specified structure by allowing 
for data-dependent shrinkage towards this structure. To our knowledge, hier- 
archical modelization based on inverse- Wishart priors has only been proposed 
in [3] with a constraint of fixing an upper bound on the degrees of freedom to 
ensure proper posterior distributions. 

In this paper we focus on hierarchical inverse- Wishart priors on the covariance 
matrix such that shrinkage toward diagonality is involved. This approach builds 
on the works of Daniels and Kass [3]. After relaxing the prior on the degrees 
of freedom, we establish the precise conditions to ensure the properness of the 
posterior distributions. We also give a detailed experimental comparison of the 
competing Bayesian models and the maximum likelihood estimator under differ- 
ent loss functions. According to the loss, the priors don't have the same effect on 
posterior inference and we outline the limits of the hierarchical inverse- Wishart 
priors as a "default" choice for the covariance matrix estimation. 
The paper is structured as follows. Section 2 presents the Normal inverse- 
Wishart model as covariance shrinkage modeling approach. In Section 3 we 
define the three hierarchical priors that we will consider and conditions to get 
proper posterior distributions. Section 4 describes the attractive Markov Chain 
Monte Carlo sampling scheme for Bayesian computation of posteriors. Section 
5 reports numerical results for the matrix estimators and for the eigenvalues 
estimators. The last section presents some conclusions and provides recommen- 
dations for using such priors. 

2. The normal inverse- Wishart model 

Let X = (A^, X^) T be a p-dimensional random vector following a 
multivariate normal distribution A/" p (0,E). E is an unknown covariance matrix 
and belongs to the set of p x p symmetric positive definite matrices S + . Given 
an independent and identically distributed sample (Xi, ...,X„) of X, we wish 
to estimate the covariance matrix. The associated likelihood function for E is 

ms) = (Ejw e * v {~l tr (s ~ ls) } (i) 

where S = J2i=i X, is the scatter matrix. 

The maximum likelihood estimator of E, T.mle, defined by — , is a classical 
estimator of E. However, it becomes unstable when p is moderate or large 
relative to the sample size n because of the large number - P< - P 2 ^ 1 ^ - of unknown 
parameters to be estimated. When n < p, T<mle is no longer positive definite. 
Even when n > p, the matrix S is positive definite but does lead to the distortion 
of the eigenstructure [27, 4] for p close to n, especially when the true matrix 
is close to be diagonal. That motivates the choice of a Bayesian regularization 
approach and leads to assign a prior on E. 
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In the absence of reliable prior information, the selection of the prior distribution 
is quite delicate and generic solutions must be chosen instead. Here we consider 
a general class of priors. Since the model X when £ _1 varies is from a natural 
exponential family of distributions, a commonly used class of distributions for 
the canonical parameter £ _1 is the conjugate family as defined by Diaconis & 
Ylvisaker [6], called the Wishart distribution. The induced prior on £ is then 
the inverse- Wishart . 

In the notation of Eaton [7], £|/3, D ~ lW(f3, D) means that £ has the inverse- 
Wishart distribution with degrees of freedom (3 > p—1 and scale matrix D e S + . 
The density is given by 

^^) = ^ ) l^|-^ ±1 exp{-^(E- 1 J D)}. (2) 
The normalising constant turns out to be equal to 

C(/?,p) = 2^T p (Q (3) 

where r p (.) is the multivariate Gamma function defined as T p (a) = 7r P<P 4 " Y\^ =1 T (a + ^T~)- 

The restriction that j3 be greater than p — 1 is necessary so that T ( ^ + 2~ 3 ) be 

well defined. Moreover, as E(£)= , the expectation of £ will exist if and 

only if (3 > p + 1. 

Now, let (Xi, ...,X„) be a Gaussian sample associated with an inverse- Wishart 
prior on £ centered in D, the posterior mean of £, a likely estimator of £, is 
then equal to 

E(£|S) = { ^l D+ 1 S = ^y^f"* with/?>p+l. 

It shows that f3 controls the amount of shrinkage: when n is held fixed and j3 is 
allowed to grow, the posterior mean tends towards D while the estimator tends 
towards £mlb if P is held fixed and n is allowed to grow. 

When we consider the eigenvalues of the posterior mean, it is easy to see that 
the eigenvalues <?j, i = 1,..., p of E(£|5) are 

W -P- +nk 

9t = : , Vi = l,...,p. 4 

p + n — p — 1 

where are the eigenvalues of £mle- 

We can check that, for Zj < rfjj, we always have k < gt and, for k > 1, we have 

Zj > Qi. 

Notice that we can consider E(£ _1 |5) _1 as the estimator of £, you will find the 
same kind of results. 

In summary, the span of the eigenvalues of Bayes estimators based on inverse- 
Wishart prior will be smaller than the span of the eigenvalues of S, which could 
be used to correct the instability of the standard estimators. 
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3. Inverse-Wishart prior distribution: choice of the hyperparameters 

As seen in Section 2 the inverse- Wishart distribution is characterized by the 
hyperparameters /3 and D. Sometimes the hyperparameters can be specified 
by the investigators but there is rarely good scientific information on which to 
base these specifications. They can be obtained by empirical Bayes estimation 
[13, 2, 24]. A frequently-applied procedure is to set the scale matrix D equal to 
^mle and p degrees of freedom. However it turns out to be not convenient as 
soon as Emlb becomes suspect. 

Consequently, we prefer to embed a structure for D. The most commonly em- 
ployed matrix targets are the identity matrix and its scalar multiples [24] . They 
are low-dimensional, thus they impose a rather strong structure which in turn 
requires only little data to fit the hyperparameters remaining to estimate. Al- 
ternatively we can assign a further prior distribution on these hyperparameters. 
This implies hierarchical models which allow a more objective approach to infer- 
ence [11]. They are supposed to provide more flexibility than non-hierarchical 
priors. As for the degrees of freedom j3, they can either be taken as small as 
possible with the idea that large values supporting the scale matrix structure 
or they can be given their own prior distribution. 

Here we adopt a fully Bayesian approach: we investigate three inverse- Wishart 
hierarchical priors with unknown degrees of freedom and scale matrix, inducing 
shrinkage towards diagonality. 

3.1. Daniels and Kass prior 

Daniels and Kass [3] assign flat improper priors on the logarithm of the 
elements of the diagonal scale matrix and a vague proper uniform distribution 
on the logarithm of the degrees of freedom, over ]p — 1; 6], with b a large value. 
Let A be a p x p diagonal definite-positive matrix, the Daniels and Kass model, 
namely Model D&K, is defined by 



S|aj,/3 A = diag(eei, ...,a p ) 

n(aj) oc —I ]0 . +oo[ (oij) 

CXj 

tt(/3)cx *I ]p _ 1;6] (/?). 



(5) 



From (1) and (5), the joint posterior distribution given the data X is equal to 

(T A fll^ IS' 1 ) 2 exp[-§fr(S- 1 (S + ^))]W | - 1 I 

n^,A,fj\b) - pn p£ a Vx(Ri)»x]p-i ; 6[( L . a i.-.Q!f 

(2tt) 2 2 2 /3r p (§) 

(6) 

The bound b must be a finite value to keep the joint posterior distribution 
proper, with constraint to be large in order to minimize its effect on inference. 
Anyway we exclude arbitrarily some values of /3: we never know how large is 
large enough. Another similar but improper prior on the logarithm of (3 is 
preferred, rather than a proper one with bounded support. 
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3.2. A "diagonal, equal variance" model as prior mean matrix 

We choose to center the prior matrix in al p . By this way, the resulting 
estimators will shrink all elements of S. Model 1 will be defined by 

V\a,p~IW{0,(0-p-l)aI p ) 

7r(a) oc ^I]o ; +co[(a) (7) 
n(/3) oc ■^■I]p_|_i ;+(X) [(/3) , with <5 positive integer. 

After a reparametrization given by <j> = a(J3 — p — 1), we derive (8), the joint 
posterior distribution of (E,<^>, /3) given S: 

fi+n+p+l p 

. E" 1 2 exp -|ir(E- 1 (5' + ^)) 

(2tt) 2 2 2 /3»r p (§)0 

(8) 

The introduction of S permits the limit superior of j3 in Model D&K to be re- 
laxed. The problem of parameter choice persists but it is hoped that this model 
influences less the posterior distribution. 

The conditions ensuring properness of the posterior distribution are driven by 
S. For S = 1, we know by [3] that (8) is improper. 

Proposition V. The posterior distribution (8) is a proper probability density 
function for all 5 > 1: 



f + f [ ir(E,<f>,0\S)d£d<l>d0 < 00 (9) 

Jp+l J}0:+co[Js+ 



This proof is deferred to Appendix Appendix A. 

It leads to a remark about the posterior distribution of /3. 



Corollary: The posterior marginal distribution of 0, ir(/3\S), has its m-th mo- 
ment and higher ones defined if m < 5 — 1. 

Consequently, if (3 is no longer considered as a nuisance parameter but as a 
parameter of interest, samples from 7r(/3|X) must be used with caution. For 
instance, for S < 3, the mean of sample paths from 7r(/3|X) has no sense since 
the posterior mean of /3 will not exist. 

3.3. A "diagonal, unequal variance" model as prior mean matrix 

We can prefer to keep a less strong structure for the matrix target, as in 
Section 3.1, where we only shrink the off-diagonal elements of S. Moreover we 
assume that the prior mean matrix does exists, equal to an arbitrary diagonal 
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matrix A. Let A be a p x p diagonal definite-positive matrix, Model 2 will be 
defined by 

X\ctj,P ~ TW{(3, (P -p- I) A), A = diag(a l , ...,a p ) 



n(aj) oc — I] 0; +oo[(aj) 

TT(^) OC ^jI] p+ 1;+oo[(/3)- 



a; • ' • (10) 



(36 

Proceed to the change of variable $ = (/?— p— 1)A and refer the diagonal matrix 
$ by its diagonal elements, </>,, i from 1 to p. The joint posterior distribution is 
then given by 

7r(£,$,/3|S) = M h+x(Rl^x]p+l;+oo[{^,n, ---^p, P)- 

(2tt) 2 2 2 /3 5 r p (f ) 

(11) 

Proposition 2: TTie posterior distribution (11) is a proper probability density 
function for all 5 > 1: 

r+oo p p 

l\ / 7r(£,$ )y 8|S>2E#i...#pd/3 < oo (12) 

The proof of Proposition 2 is deferred to the Appendix Appendix A. Note 
that the posterior distribution (11) for Model 2 appears quasi-identical to the 
density (6) for Model D&K, although the prior hypotheses differ. The existence 
of the prior mean matrix is only taken into account in the lower bound on j3 
in the posterior distribution. The second difference is in the power of (3 in the 
denominator. 



4. Computation of Bayes estimators 

Now we can derive the Bayes estimators related to the prior models of Sec- 
tions 3.2 and 3.3, under two loss functions. We will use Markov Chain Monte 
Carlo (MCMC) simulations to estimate these posterior quantities numerically. 

4-1- Loss functions and associated estimators 

A question that naturally arises in various contexts in multivariate analysis 
and related topics is whether to estimate E or its inverse. We choose to focus 
on the estimation of S in this paper. This parameter has a natural and well 
understood interpretation in multivariate analysis and its direct estimation has 
numerous applications. 

In the following, the Bayes estimators of E are calculated with respect to two 
common loss functions: 

• the squared Frobenius loss function 

L 2 (E,E) ^tr(E-E) 2 
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• Stein's loss function 

Li(E, E) = tr(EE _1 ) - logdct(EE- 1 ) - p 

The corresponding Bayes estimators for E are, respectively, E 2 = E(S|S) and 
Ei =E(E- 1 |S)~ 1 . 

The L 2 loss corresponds to the equivalent of the squared error loss function in a 
matrix setting. Thus £2(2, E) = X^=i ^2i=i(&ij ~~ a ij) 2 ^ s a natural quadratic 
measure of distance between the true (E) and inferred covariance matrix (E). 
The L\ loss was introduced by Stein [27] to estimate the multinormal covariance 
matrix and also called entropy loss. It results from evaluating the divergence 

of Kullback-Leibler, namely J p(x)log j^fy} dx for two Gaussian distributions 

with densities p{x) and q(x) defined by covariance matrices E and E. This scale 
invariant loss function will penalize the relative estimation error of the small 
eigenvalues, as illustrated in Section 5. Various alternative losses have also been 
proposed in the literature [28]. 

Both estimators are approximated by using the MCMC sampling algorithm 
described in Section 4.2. 

4-2. MCMC algorithm for sampling posterior distributions 

Once we introduce hierarchical priors, the conjugate structure that typically 
makes Gibbs sampling so attractive is no longer ensured. Fortunately, since the 
inverse- Wishart distributions is also a conditionally-conjugate family, the full 
conditional distribution of E is still inverse- Wishart. Moreover the full condi- 
tional for the additional parameters $ and (3 can be also simulated easily. 
Sampling from the target posterior will call on iterative resampling from inverse- 
Wishart, inverse-Gamma distributions and from the posterior distribution of fj. 
The simulation of the latter is based on a Metropolis sampling scheme, described 
in Algorithm 3. 

A finite-sample distribution from the posterior distribution (8) of (E, <f), (3) from 
Model 1 can then be obtained by a systematic scan Metropolis-Hasting-within- 
Gibbs algorithm [23], described by Algorithm 1. 

Then to obtain samples from the posterior distribution (11) of (E,$,/3) from 
Model 2, you need to follow all steps of Algorithm 1 except Step 4 and instead, 
use Step Aus described in Algorithm 2. 

4-3. The Metropolis sampling method for j3 
The density of /3( fe )|E( fe ), S is such that 



7T oc cxp 

{k) = logjEW- 1 ! +log|$W| - P log2 



^C^-Jlog^-log^^) 



I ]p+ l; + 00[(/3 (fe) ) 



(13) 
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Algorithm 1 Metropolis-Hasting-within-Gibbs sampling scheme for the joint 
posterior (8) 



1: Initialization with k = and arbitrary values for </> and S 

2: Increment k = k + 1 

3: Draw a sample ZW^- 1 ) , pC*' 1 ) ,S ~ IW (Z?^" 1 ) + n, S + 4>( k ~V I) 

4: Draw a sample ,£(*), S ~ (*^, 

5: Draw a sample /?( fe ) - 7r(/3< fc ) |S( fe ) , 0( fe ), S), see Algorithm 3 

6: Return to 2 except in the case of stop criterion 



Algorithm 2 Modification in Algorithm 1 for sampling from the joint posterior 
(11) 

4 bls : For j from 1 to p, draw asample ^ \f3 ( - k ' 1 \^ k \ S ~ 



We describe a random-walk Metropolis algorithm to sample from the distribu- 
tion of 7 fe when = log((3^ — p — 1). The proposed algorithm is a Markov 
chain with a Gaussian random- walk centered in the value ^ k ~^ as symmetric 
proposal density q at iteration k. 



" (fe-!) 

(j corresponds to the variance of 7 , estimated by numerical integration. 

/ / prop \ 

The acceptance probability is equal to min ( 1, ^y (fc _ 1) ^ 

Samples from (13) can be obtained using the mapping described in Algorithm 
3. 



5. Simulation results 

The objective of this section is to give a detailed comparison between dif- 
ferent estimators: the proposed Bayes estimators and the maximum likelihood 
estimator. More specifically, the study will consider the true matrices used in 
[3], with different structures and conditionings 1 . We will derive frequentist char- 
acteristics of the Bayesian procedures in terms of risks associated to three loss 



1 A matrix is said ill-conditioned if the ratio of its maximum and minimum eigenvalue is 
large. The closer it is to 1, the better conditioned the matrix is. 
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Algorithm 3 The random-walk Metropolis algorithm for the joint posterior 
(13) 



1: Set 7 ( fc - 1 ) = logO^- 1 ' - p - 1) 



2: Compute a'- 



- 9 (*-i) 



by numerical integration 



3: Sample 7^ from g( 7 P™P| 7 (fc-i) ; 2 x a 



-,(fc-i) 



4: Set log(p) = log g ( 7 P ro P) - logg( 7 ( fe - 1 )) 



5: Sample w from W[o ; i] 



6: If log/? > logw then -f^ = jP ro P. Otherwise 7 (fc ) = 7 ( fe ^ 
7: Set /3W = exp( 7 ( fe )) +p+l 



functions. Through this simulation study, we will get ideas about the behaviour 
of the competing estimation methods in different situations. 

5.1. Simulation design 

We carried out the simulation study from Section 3 of Daniels and Kass's 
paper [3]. They consider seven covariance matrices of dimension p = 5: three 
diagonal and four non-diagonal matrices. The first, A, is an identity matrix; the 
second, B, represents a covariance matrix with roughly equally spaced eigenval- 
ues increasing in powers of 0.75 from 0.75° to 0.75 4 ; the third, C, is a somewhat 
ill-conditionned matrix, with eigenvalues equal to 0.75°, 0.75 1 , 0.75 2 , 0.75 10 and 
0.75 20 . They are then combined with rotations to produce four full true covari- 
ance matrices: Bl and CI are matrix B and matrix C with Givens angles all 
set to f , B2 and C2 are matrixces B and C with Givens angles evenly spaced 
between (-|,f). 

From each covariance matrix, we do the following simulation process: 

• simulate a sample of size n, 

• compute Si iZ/1 , t, 2tLl , ^dk.l^ ^i,l 2 , ^2,l 2 and iW,L 2 respectively the 
estimators for S from Model 1 of Section 3.2 (with S = 2), from Model 2 of 
Section 3.3 (with S = 2), from Model D&K of Section 3.1 (with b = 10 6 ), 
under L\ and L 2 losses, 

• compute t<MLE, 



• compute the associated loss Lj(E,S) for i = 1,2 



S. 



and for each estimator 
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These estimations are carried out with the Metropolis-within-Gibbs algorithms 
described in Section 4.2 with 20,000 iterations, among which 5,000 are used for 
the burn-in period. 

We then compare the different estimators with respect to the risk function 
= Es i = 1,2. These frequentist risks are approximated 

by repeating 100 times the previous simulation process. 

From Propositions 1 and 2, S must be strictly greater than 1. For the remainder 
of this paper, we will choose, quite arbitrarily, the smallest possible integer for 
S, that is, 6 = 2. As an extension to this work, we could assign a prior on this 
parameter. 

5.2. Performance comparisons 

Here we proceed to comparison of risks - associated to a specific loss, L\ 
or L 2 - between the prior models. Then we compare the ability to accurately 
estimate the eigenvalues accross all competing estimators. 

5.2.1. Under the Frobenius loss 

Table 1 summarizes the simulation results for the frequentist risk R 2 for 
sample sizes n — 5 and n = 100. 

The main remark is that, for n — 5, the shrinkage estimator from Model 1 always 
leads to (sometimes dramatic: risk divided by 2) improvement in accuracy over 
the alternative procedures. Secondly the estimators from Model 2 and Model 
D&K fail to accurately estimate and do even worse than the maximum likelihood 
estimator in all cases. The models in question lead only to shrinkage of the 
off-diagonal elements and such regularization methods tend to not have the 
expected beneficial effect of being more precise, under L 2 loss. In contrast 
Model 1 involves more severe shrinkage, which is given a very positive welcome. 
For a bigger sample size, the differences in performance between the estimators 
disappear except for the identity-matrix case (A). 

5.2.2. Under Stein's loss 

Table 2 gives the simulation results for the frequentist risk Ri for sample 
sizes n = 5 and n = 100. 

When n=5, the Bayes estimators provide substantial improvement in risk com- 
pared to the sample covariance matrix for the well-conditioned matrices (A, 
B, Bl, B2), with risks from 3.78 to 8.7 times smaller. In these cases, Model 
1 does somewhat better than Model 2 and Model D&K but as soon as S is 
ill-conditioned (C, CI, C2), it does worse than all other estimators. For the ill- 
conditioned diagonal matrix (C), the estimators from Model 2 and Model D&K 
perform well compared to the sample covariance matrix while they do poorly 
for its two rotated versions of matrices (CI, C2). 

As expected, when the sample size becomes large (here n=100), the differences 
in performance between the estimators become blurred. Nevertheless in the 
well-conditioned cases the Bayes estimators still offer a non-negligible percent- 
age reduction in risk compared to the sample covariance matrix. 
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n = 


= 5 




True matrices 






^DK,L 2 




A 


0.88 (0.13) 


8.29 (0.71) 


7.09 (0.59) 


1.72 (0.12) 


B 


0.57 (0.06) 


3.87 (0.45) 


3.15 (0.39) 


0.83 (0.08) 


Bl 


0.55 (0.06) 


3.75 (0.33) 


3.22 (0.32) 


0.79 (0.07) 


B2 


0.39 (0.04) 


3.48 (0.30) 


2.77 (0.23) 


0.73 (0.06) 


C 


0.57 (0.05) 


2.87 (0.46) 


2.69 (0.78) 


0.62 (0.09) 


CI 


0.45 (0.05) 


2.39 (0.35) 


2.39 (0.34) 


0.60 (0.09) 


C2 


0.47 (0.06) 


2.07 (0.27) 


2.17 (0.28) 


0.64 (0.08) 






n = 


100 




A 


0.03 (0.004) 


0.11 (0.007) 


0.10 (0.007) 


0.09 (0.006) 


B 


0.04 (0.003) 


0.04 (0.004) 


0.04 (0.004) 


0.04 (0.003) 


Bl 


0.04 (0.003) 


0.04 (0.004) 


0.04 (0.004) 


0.04 (0.004) 


B2 


0.04 (0.003) 


0.05 (0.004) 


0.04 (0.004) 


0.04 (0.004) 


C 


0.04 (0.004) 


0.04 (0.004) 


0.04 (0.004) 


0.04 (0.004) 


CI 


0.03 (0.003) 


0.04 (0.004) 


0.04 (0.004) 


0.03 (0.003) 


C2 


0.03 (0.002) 


0.03 (0.002) 


0.03 (0.002) 


0.03 (0.002) 



Table 1: Comparison in risk between the Bayes estimators and the usual estimator of covari- 
ance matrix under L2. The values in parentheses refer to the simulation standard errors. 







n = 


= 5 




True matrices 


£l,Li 




^DK,Li 




A 


0.66 (0.06) 


1.42 (0.07) 


1.18 (0.07) 


5.75 (0.25) 


B 


0.87 (0.05) 


1.50 (0.09) 


1.26 (0.08) 


5.61 (0.21) 


Bl 


0.77 (0.05) 


1.45 (0.07) 


1.34 (0.07) 


5.77 (0.23) 


B2 


0.70 (0.04) 


1.47 (0.07) 


1.34 (0.06) 


5.52 (0.21) 


C 


42.42 (0.26) 


1.37 (0.08) 


1.17 (0.07) 


5.69 (0.22) 


CI 


41.87 (0.26) 


17.16 (0.97) 


26.72 (2.39) 


6.31 (0.32) 


C2 


42.77 (0.25) 


26.36 (1.97) 


42.04 (3.30) 


6.22 (0.25) 






n = 


100 




A 


0.03 (0.003) 


0.07 (0.003) 


0.06 (0.003) 


0.15 (0.006) 


B 


0.11 (0.004) 


0.07 (0.004) 


0.06 (0.004) 


0.15 (0.005) 


Bl 


0.12 (0.005) 


0.13 (0.005) 


0.13 (0.006) 


0.16 (0.006) 


B2 


0.13 (0.005) 


0.13 (0.005) 


0.14 (0.006) 


0.16 (0.006) 


C 


0.18 (0.007) 


0.07 (0.004) 


0.06 (0.003) 


0.15 (0.005) 


CI 


0.18 (0.007) 


0.16 (0.006) 


0.16 (0.006) 


0.16 (0.006) 


C2 


0.19 (0.006) 


0.17 (0.006) 


0.16 (0.006) 


0.16 (0.006) 



Table 2: Comparison between estimators with respect to the risk function Ri. The values in 
parentheses refer to the simulation standard errors. 
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s i,l 2 




^DK,L 2 


^MLE 


A 


0.48 (0.02) 


0.73 (0.02) 


0.67 (0.02) 


0.25 (0.02) 


0.45 (0.02) 


0.46 (0.02) 


0.97 (0.01) 


B 


0.35 (0.03) 


0.58 (0.02) 


0.50 (0.03) 


0.65 (0.06) 


0.43 (0.03) 


0.44 (0.03) 


0.95 (0.01) 


Bl 


0.29 (0.02) 


0.52 (0.02) 


0.45 (0.02) 


0.71 (0.05) 


0.50 (0.04) 


0.49 (0.04) 


0.95 (0.01) 


B2 


0.29 (0.02) 


0.54 (0.02) 


0.47 (0.02) 


0.67 (0.05) 


0.45 (0.04) 


0.42 (0.03) 


0.95 (0.01) 


C 


43.58 (2.62) 


0.43 (0.03) 


0.43 (0.04) 


93.67 (4.47) 


1.01 (0.09) 


0.98 (0.10) 


0.79 (0.02) 


CI 


43.15 (2.63) 


11.47 (0.74) 


13.96 (1.05) 


94.28 (4.27) 


36.87 (2.12) 


40.38 (2.45) 


0.86 (0.02) 


C2 


44.06 (2.50) 


19.26 (1.10) 


26.82 (1.89) 


97.61 (4.64) 


65.21 (3.48) 


76.81 (4.59) 


0.85 (0.02) 



Table 3: Comparison in risk under L\ min , between the smallest eigenvalue of the competing 
estimators when n = 5. The values in parentheses refer to the standard errors. 



5.2.3. About eigenvalues estimation 

An important issue in covariance matrix estimation is the bias of the estima- 
tors of the extreme eigenvalues. We can then compare the bias in the eigenvalues 
of estimates based on the different models and approaches and we try to en- 
lighten the previous results. 
We set 

Aj — Xi 

L\.(Xi, Xi) = , with i e {min,max} (14) 

Xi 

as a relative measure of distance between the true (Aj) and inferred eigenvalue 
(Ai)- 

Tables 4 and 3 give results for the minimal (A TO j„) and maximal (X max ) eigenval- 
ues of the seven estimators for each type of covariance matrix. The eigenvalues 
of T^mle differ greatly from the true values, especially when £ is close to the 
identity matrix. S 2 ,l i; S dk.Li an d £i,z, 2 appear to successfully estimate 

X m ax in all cases but fail in estimating X m i n in case of a strongly misspecified 
target. ^2,l 2 an d ^dk ,l 2 can improve X m i n but is always inappropriate for an 
accurate estimation of X max . 

These results, combined with the boxplots of the smallest and largest eigenval- 
ues, presented Annexe Appendix B, highlight that the Bayes methods reduce 
the distorsion of the eigenvalue spectrum but that this effect can be too pro- 
nounced. Moreover the results stress that the performances under L\ loss can be 
explained by the eigenvalues shrinkage: overskhrinkage of the small eigenvalues 
will imply a poor performance under L\ loss whereas it will not be penalized by 
L 2 loss which focuses on the errors on big values. 

Estimator derived from Model 1 will be forced to be well-conditioned. Conse- 
quently it will yield to overshrinkage as soon as the true matrix has its eigen- 
values far apart, but only of the smallest eigenvalues. 

The overestimation phenomenom can be explained by the lower bound of the 
hyperparameter (3 as illustrated in Section 5.2.4. 

5.2.4- Posterior distribution of (3 

As seen in Section 2, the hyperparameter f3 is known to control the amount 
of shrinkage. As the use of hierarchical models allows to estimate (3 from data, 
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£%L 2 


^DK,L 2 


£>MLE 


A 


0.27 (0.02) 


0.59 (0.04) 


0.64 (0.05) 


0.84 (05) 


2.26 (0.10) 


1.99 (0.10) 


1.70 (0.08) 


B 


0.34 (0.02) 


0.37 (0.03) 


0.40 (0.06) 


0.33 (0.03) 


1.21 (0.10) 


1.02 (0.09) 


0.68 (0.06) 


Bl 


0.29 (0.02) 


0.28 (0.02) 


0.31 (0.03) 


0.37 (0.04) 


1.24 (0.08) 


1.03 (0.08) 


0.84 (0.07) 


B2 


0.29 (0.02) 


0.27 (0.02) 


0.29 (0.02) 


0.30 (0.02) 


1.14 (0.07) 


0.94 (0.06) 


0.71 (0.05) 


C 


0.41 (0.02) 


0.33 (0.03) 


0.34 (0.04) 


0.38 (0.03) 


1.10 (0.09) 


0.96 (0.10) 


0.57 (0.05) 


CI 


0.38(0.02) 


0.31 (0.02) 


0.33 (0.03) 


0.42 (0.04) 


1.06 (0.08) 


1.02 (0.09) 


0.66 (0.06) 


C2 


0.37 (0.02) 


0.31 (0.02) 


0.34 (0.02) 


0.40 (0.04) 


0.93 (0.07) 


0.90 (0.08) 


0.67 (0.06) 



Table 4: Comparison in risk under L\ maa ., between the largest eigenvalue of the competing 
estimators when n = 5. The values in parentheses refer to the simulation standard errors. 



3 00 6 05 



Figure 1: Histogram of the posterior samples of /3 from (11) when n=100 in the full ill- 
conditioned case (C2). The red line represents the lower bound of f}. 



data will determine the shrinkage intensity. Therefore, if the structured scale 
matrix is close to the true matrix, then j3 would take high values. Inversely 
one expects to get values for (3 arbitrarily small in the case where data support 
structure far from the prior scale matrix configuration. However, as mentioned 
by Daniel and Kass [3], a low intensity for shrinkage will be impossible to obtain 
because j3 must be always bounded by p — 1 minimum. 

For illustration Figure 1 represents the histogram of posterior samples of f3 issued 
from Model 2 (a sample path of length 15,000 for one of the 100 datasets) when 
the true covariance matrix is C2 and n — 100. The posterior distribution of j3 
is very concentrated on the lower bound p+l. 
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6. Conclusion and discussion 

In this paper we proposed hierarchical Bayesian shrinkage methods for the 
estimation of covariance matrices in a small sample setting. Inverse- Wishart pri- 
ors were considered with unknown hyperparameters. We focused on a "diagonal, 
common variance" and two "diagonal, unequal variance" models as covariance 
targets. Models are with 2 to p+1 free parameters, on which we assigned non- 
informative priors. We showed in details the conditions to ensure the property 
of the posterior distributions and proposed a Metropolis-Hasting-within-Gibbs 
algorithm to sample from them. Then we gave a detailed comparison between 
the different Bayesian estimators and the classical maximum likelihood covari- 
ance estimator, under three loss functions. 

As statistically efficient and computationally fast alternative to the widely used 
standard covariance estimators, we recommend the shrinkage covariance esti- 
mators which shrink all components of the empirical covariance matrix, that is 
not only perfectly applicable to small samples but can also improve the classical 
estimators for large n. These improved estimators exhibit none of the defects 
of the standard covariance estimators, in particular they reduce variance, they 
are always positive definite and well-conditioned. This property might imply 
overestimation of the small eigenvalue. By producing a well-conditioned posi- 
tive definite covariance estimate one automatically also obtains an equally well- 
conditioned estimate of the inverse covariance - a quantity of crucial importance, 
for instance, in classification or graphical models. For other goals like reduc- 
tion dimension, principal component analysis needs to successfully estimate the 
largest eigenvalues rather than the smallest ones. However, for applications 
where we have to focus on estimation of all eigenvalues, care must be taken. In- 
deed under Stein's loss function, the proposed models can be unefficient. This 
happens when the data is in conflict with the specified prior structure, it is 
directly due to the lower bound of the degrees of freedom of an inverse- Wishart 
prior. Overshrinkage of the small eigenvalues is then caused. 
The evidence on differences in estimation performance of different estimation 
approaches and models suggest that there is no "best approach" and that the 
relative accuracy of one approach or model in comparison to another depends 
strongly on the problem, that's one of the basic principles that underlie the 
Bayes paradigm. 

A direct perspective of this work would be to investigate more complex models 
to overcome the issue of overshrinkage. Further work will be to investigate the 
effects of the estimation procedures in real data on the Value at Risk (VaR) 
computation. The VaR is a widely used tool for risk assessment in finance and 
is defined as a quantile of the predictive probability distribution for the amount 
of a future financial loss. The standard method for approximating the VaR 
is based on calculations using Monte-Carlo simulations of asset prices from a 
Gaussian distribution with unknown covariance matrix. The classical estimators 
have no full rank. Besides, in this application, focus is on the largest eigenvalues 
and then the proposed shrinkage estimator seems to be appropriate. 



15 



Acknowledgments 

We wish to thank Christian Robert and Jean-Michel Marin for their contri- 
bution. 

We would like to thank EDF (company of Electricity of France), especially 
Financial Direction, to support the thesis of Mathilde Bouriga. 



16 



References 



[1] Barnard, J., McCulloch, R., Meng, X., 2000. Modeling covariance matri- 
ces in terms of standard deviations and correlations, with applications to 
shrinkage. Statistica Sinica 10, 1281-1311. 

[2] Champion, C, 2001. Empirical Bayesian estimation of normal variances 
and covariances. Journal of Multivariate Analysis 26 (2), 60-79. 

[3] Daniels, M., Kass, R., 1999. Nonconjugate bayesian estimation of covari- 
ance matrices and its use in hierarchical models. Journal of the American 
Statistical Association 94 (448), 1254-1263. 

[4] Dempster, A., 1969. Elements of Continuous Multivariate Analysis. 
Addison- Wesley, Reading, Mass. 

[5] Dey, D., Srinivasan, C, 1985. Estimation of a covariance matrix under 
Stein's loss. The Annals of Statistics 13 (4), 1581-1591. 

[6] Diaconis, P., Ylvisaker, D., 1979. Conjugate priors for exponential families. 
The Annals of Statistics 7 (2), 269-281. 

[7] Eaton, M., 1983. Multivariate Statistics : A Vector Space Approach. Wiley. 

[8] Efron, B., Morris, C, 1974. Multivariate empirical bayes and estimation of 
covariance matrices. The Annals of Statistics 4, 22-32. 

[9] Franck, C, Zako'ian, J., 2000. Covariance matrix estimation for estimators 
of mixing weak arma models. Journal of statistical planning and inference 
83 (2), 369-394. 

[10] Furrer, R., Bengtsson, T., 2007. Estimation of high-dimensional prior and 
posteriori covariance matrices in kalman filter variants. Journal of Multi- 
variate Analysis 98, 227-255. 

[11] Gelman, A., 2006. Prior distributions for variance parameters in hierarchi- 
cal models. Bayesian Analysis 1 (3), 515-533. 

[12] Haff, L., 1979. Estimation of the inverse covariance matrix : random mix- 
tures of the inverse wishart matrix and the identity. The Annals of Statistics 
7 (6), 1264-1276. 

[13] Haff, L., 1980. Empirical bayes estimation of the multivariate normal co- 
variance matrix. The Annals of Statistics 8 (3), 586-597. 

[14] James, W., Stein, C, 1961. Estimation with quadratic loss. Proceedings of 
the Fourth Berkeley Symposium on Mathematical and Statistical Proba- 
bilities Vol. 1, University of California Press, Berkeley, 361-379. 

[15] Jeffreys, H., 1961. Theory of probability. 3rd ed. Oxford Classic Texts in 
the Physical Sciences, Oxford: Oxford University Press. 



17 



[16] Kubokawa, T., 2004. A revisit to estimation of the precision matrix of the 
Wishart distribution. Unpublished discussion paper. 

[17] Kubokawa, T., Srivastava, M., 1999. Estimating the covariance matrix : a 
new approach. Unpublished discussion paper. 

[18] Ledoit, O., Wolf, M., 2002. Improved estimation of the covariance matrix of 
stock returns with an application to protfolio selection. Journal of empirical 
finance 10, 603-621. 

[19] Leonard, T., Hsu, J., 1992. Bayesian inference for a covariance matrix. The 
Annals of Statistics 20 (4), 1669-1696. 

[20] Lin, S., Perlman, M., 1984. A monte carlo comparison of four estimators 
of covariance matrix. Tech. Rep. 44, University of Washington, Dept. of 
statistics. 

[21] Robert, C, 2001. The Bayesian Choice: from Decision-Theoretic Motiva- 
tions to Computational Implementation. Springer. 

[22] Robert, C, Marin, J., 2007. Bayesian core. Springer- Verlag, New York. 

[23] Roberts, C, Rosenthal, J., 2006. Harris recurrence of metropolis-within- 
gibbs and trans-dimensional markov chains. The Annals of Applied Statis- 
tics 16 (4), 2123-2139. 

[24] Schafer, J., Strimmer, K., 2005. A shrinkage approach to large-scale covari- 
ance matrix estimation and implications for functional genomics. Statistical 
applications in genetic and molecular biology 4 (32). 

[25] Smith, M., Kohn, R., 2002. Parsimonious covariance matrix estimation for 
longitudinal data. Journal of the American Statistical Association 97 (460), 
1140-1153. 

[26] Stein, C, 1956. Some problems in multivariate analysis. Tech. Rep. 6, 
Standford University, Dept. of Statistics. 

[27] Stein, C, 1975. Estimation of a covariance matrix. Rietz lecture, 39th an- 
nual meeting IMS. Atlanta, Georgia. 

[28] Yang, R., Berger, J., 1994. Estimation of a covariance matrix using the 
reference prior. The Annals of Statistics 22 (3), 1195-1211. 



18 



Appendix A. Proof of Proposition 2 

We want to evaluate whether 

f + OO 



/-t-OO !■ !■ 

/ / / 7r(E, $, (3\S)dT,d(j)i...d<f)pdf3 (A.l) 

Jp+l J]0;+oo[p JS+ 

is finite or not. 

By Fubini's theorem: 

/ / / 7r(S,$, J 8|5)rfE# 1 ...#pd/3= / ir(0\S)dp. (A.2) 

Thus it is sufficient to prove the convergence of the posterior marginal distribu- 
tion of j3. 

We begin to marginalize the density n(T,,$ 7 (3\S) over £ to obtain an expression 
for tt($,/3|S). We have: 

7r($,/3|5)oc pl 2 j l$l 



r p (f) |5 + $|^^- 

Denote QAQ T , the spectral decomposition of the matrix S. Then 



tt($,/3|S) cx 



r f ( -2> \QAQ T +Q<S>Q T \^- P 



3C 



CX 



cx 



r„(^) ffll" 1 



1 p(s) lO(A+®)0 T l^- ' 



r p (g±^) |<&| # - 1 i 

r p(Tf) (|QQ T ||A+$|)T _ P 

rp(^) mu^f' 1 1 
nr =1 (A ( +w^^' 



Hence ^|fif) cx Jj 0;+co[p E^_JL^_ ^ ^ #i . 

We just need to find a density #(/?) which dominates the positive function tt((3\S) 
on ]p + 1; +oo[ and define the conditions so that the latter converges. 
Let A m i„ be the smallest eigenvalue of S, it follows: 

APIS) < r(h tv a l^^ ^n^ ^ e]P + i;+oo[ 

= tt(/3|S = A min J). (A.3) 
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Let us check if tt(/3\S — A„ lin 7) is an integrable function over ]p + 1; +oo[. 
We have 



oc 



oc 



r p(^) 1 TIP f+°° / tl JLrlA, 

r as lli = iJ \ \\ mi „ lo „ , , >t 3 +" 0, ^ 



In each single integral over (/> , , the density of a Fisher F-distribution with (3 and 
n degrees of freedom appears for the random variables . ,. . It implies: 



and thus leads to 



r P (^)r ( frr( t r i 



r F (|) r(^) P 



From the integration properties for positive functions, the inequality () implies 
the same inequality for their integrals. Hence 

/t_|_QQ /t_|_QQ n P 

/ tt(0\S) < / 7r(^,...,<^,/3|S = A roi „J) 17 

Jp+l Jp+l J]0;+co[P i=1 

This function is well-defined in p + 1. Let us see the behaviour in +oo. 

By definition T p (f3) — it P<P 4 " n^=i + ^^)- Furthermore Stirling's formula 
provides an approximation for the Gamma function: r(/3) ~ exp~^ /3^ _1 / 2 (27r) 1 / 2 . 

/3->+oo 

In consequence we get 
r p (/3) ~ . £i£ ^2^exp(E^ 1 ^ i )exp(- P /?)n^ 1 (/3+ V)^ 4 - 
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Hence 



Moreover 



As a result 



r P (f) nLr(^; 



llj = l J -V 2 

2 )llj=l\ 2 



exp(- P ^)m= 1 (^ 1±i )^ 



^-+00 exp(-pf)n^ =1 (^ ±i 



r(f) exp(-f)(f)^ 



r( _) ^+oo exp( _(^)^^ 2 

exp(f) 

0-> + OO ' 

v 1 7 r p (f ) V r (^) / 13 

^ exp(f ) 1 



0-+oo 2W2 exp (2f) /3 a 



2 ' V2' 

1 



Thus, by (A. 4), ir(j3\S) is integrable as soon as 6 > 1. Together with (A. 2), the 
same conclusion holds for 7r(£, (j>, f3\S) so that (A.l) is finite. 

Appendix B. Complement for simulation results 

These boxplots show the shrinkage effects on the estimates for the extreme 
eigenvalues. 
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M1.L1 M2.L1 DK.L1 M1 ,L2 M2.L2 DK.L2 ML Ml ,1.1 M2.L1 DK.L1 M1 ,L2 M2.L2 DK.L2 ML 




M1.L1 M2.L1 DK.L1 M1 ,L2 M2.L2 DK.L2 ML M1 ,L1 M2.L1 DK.L1 M1 ,L2 M2.L2 DK.L2 ML 




M1.L1 M2.L1 DK.L1 M1 ,L2 M2.L2 DK.L2 ML 



M1.L1 M2.L1 DK.L1 M1 ,L2 M2.L2 DK.L2 



Figure B.2: Boxplots of the smallest (left) and largest (right) eigenvalues estimates when the 
true matrices are diagonal. The horizontal lines represent the true values of the eigenvalues. 
Top-to-bottom: A, the identity matrix - B, the well-conditioned diagonal matrix - C, the 
ill-conditioned diagonal matrix. 
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Figure B.3: Boxplots of the smallest (left) and largest (right) eigenvalues estimates when 
the true matrices are full. The horizontal lines represent the true values of the eigenvalues. 
Top-to-bottom: B2, the well-conditioned full matrix - C2, the ill-conditioned full matrix. 
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